344 research outputs found
Query-driven document partitioning and collection selection
Abstract — We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a list recording the queries for which the document itself is a match, along with their ranks. To both partition the collection and build the collection selection function, we co-cluster queries and documents. The document clusters are then assigned to the underlying IR servers, while the query clusters represent queries that return similar results, and are used for collection selection. We show that this document partition strategy greatly boosts the performance of standard collection selection algorithms, including CORI, w.r.t. a round-robin assignment. Secondly, we show that performing collection selection by matching the query to the existing query clusters and successively choosing only one server, we reach an average precision-at-5 up to 1.74 and we constantly improve CORI precision of a factor between 11 % and 15%. As a side result we show a way to select rarely asked-for documents. Separating these documents from the rest of the collection allows the indexer to produce a more compact index containing only relevant documents that are likely to be requested in the future. In our tests, around 52 % of the documents (3,128,366) are not returned among the first 100 top-ranked results of any query. I
A look at the Hubble speed from first principles
We introduce a novel way of measuring from a combination of independent
geometrical datasets, namely Supernovae, Baryon Acoustic Oscillations and
Cosmic Chronometers, without the need of calibration nor of the choice of a
cosmological model. Our method builds on the \emph{distance duality relation}
which sets the ratio of luminosity and angular diameter distances to a fixed
scaling with redshift, for any metric theory of gravity with standard photon
propagation. In our analysis of the data we employ Gaussian Process algorithms
to obtain constraints that are independent from the underlying cosmological
model. We find Km/s/Mpc, showing that it is possible to
constrain with an accuracy of with minimal assumptions. While
competitive with current astrophysical and cosmological constraints, our result
is not precise enough to solve the Hubble tension in a definitive way. However,
we uncover some interesting features that hint at a twofold solution of the
tension.Comment: 7 pages, 5 figures. Any comments are mostly welcom
Interpretable Predictions of Tree-based Ensembles via Actionable Feature Tweaking
Machine-learned models are often described as "black boxes". In many
real-world applications however, models may have to sacrifice predictive power
in favour of human-interpretability. When this is the case, feature engineering
becomes a crucial task, which requires significant and time-consuming human
effort. Whilst some features are inherently static, representing properties
that cannot be influenced (e.g., the age of an individual), others capture
characteristics that could be adjusted (e.g., the daily amount of carbohydrates
taken). Nonetheless, once a model is learned from the data, each prediction it
makes on new instances is irreversible - assuming every instance to be a static
point located in the chosen feature space. There are many circumstances however
where it is important to understand (i) why a model outputs a certain
prediction on a given instance, (ii) which adjustable features of that instance
should be modified, and finally (iii) how to alter such a prediction when the
mutated instance is input back to the model. In this paper, we present a
technique that exploits the internals of a tree-based ensemble classifier to
offer recommendations for transforming true negative instances into positively
predicted ones. We demonstrate the validity of our approach using an online
advertising application. First, we design a Random Forest classifier that
effectively separates between two types of ads: low (negative) and high
(positive) quality ads (instances). Then, we introduce an algorithm that
provides recommendations that aim to transform a low quality ad (negative
instance) into a high quality one (positive instance). Finally, we evaluate our
approach on a subset of the active inventory of a large ad network, Yahoo
Gemini.Comment: 10 pages, KDD 201
Efficient Diversification of Web Search Results
In this paper we analyze the efficiency of various search results
diversification methods. While efficacy of diversification approaches has been
deeply investigated in the past, response time and scalability issues have been
rarely addressed. A unified framework for studying performance and feasibility
of result diversification solutions is thus proposed. First we define a new
methodology for detecting when, and how, query results need to be diversified.
To this purpose, we rely on the concept of "query refinement" to estimate the
probability of a query to be ambiguous. Then, relying on this novel ambiguity
detection method, we deploy and compare on a standard test set, three different
diversification methods: IASelect, xQuAD, and OptSelect. While the first two
are recent state-of-the-art proposals, the latter is an original algorithm
introduced in this paper. We evaluate both the efficiency and the effectiveness
of our approach against its competitors by using the standard TREC Web
diversification track testbed. Results shown that OptSelect is able to run two
orders of magnitude faster than the two other state-of-the-art approaches and
to obtain comparable figures in diversification effectiveness.Comment: VLDB201
Tour recommendation for groups
Consider a group of people who are visiting a major touristic city, such as NY, Paris, or Rome. It is reasonable to assume that each member of the group has his or her own interests or preferences about places to visit, which in general may differ from those of other members. Still, people almost always want to hang out together and so the following question naturally arises: What is the best tour that the group could perform together in the city? This problem underpins several challenges, ranging from understanding people’s expected attitudes towards potential points of interest, to modeling and providing good and viable solutions. Formulating this problem is challenging because of multiple competing objectives. For example, making the entire group as happy as possible in general conflicts with the objective that no member becomes disappointed. In this paper, we address the algorithmic implications of the above problem, by providing various formulations that take into account the overall group as well as the individual satisfaction and the length of the tour. We then study the computational complexity of these formulations, we provide effective and efficient practical algorithms, and, finally, we evaluate them on datasets constructed from real city data
Community Membership Hiding as Counterfactual Graph Search via Deep Reinforcement Learning
Community detection techniques are useful tools for social media platforms to
discover tightly connected groups of users who share common interests. However,
this functionality often comes at the expense of potentially exposing
individuals to privacy breaches by inadvertently revealing their tastes or
preferences. Therefore, some users may wish to safeguard their anonymity and
opt out of community detection for various reasons, such as affiliation with
political or religious organizations.
In this study, we address the challenge of community membership hiding, which
involves strategically altering the structural properties of a network graph to
prevent one or more nodes from being identified by a given community detection
algorithm. We tackle this problem by formulating it as a constrained
counterfactual graph objective, and we solve it via deep reinforcement
learning. We validate the effectiveness of our method through two distinct
tasks: node and community deception. Extensive experiments show that our
approach overall outperforms existing baselines in both tasks
Misspelling Oblivious Word Embeddings
In this paper we present a method to learn word embeddings that are resilient
to misspellings. Existing word embeddings have limited applicability to
malformed texts, which contain a non-negligible amount of out-of-vocabulary
words. We propose a method combining FastText with subwords and a supervised
task of learning misspelling patterns. In our method, misspellings of each word
are embedded close to their correct variants. We train these embeddings on a
new dataset we are releasing publicly. Finally, we experimentally show the
advantages of this approach on both intrinsic and extrinsic NLP tasks using
public test sets.Comment: 9 Page
Sheaf Neural Networks for Graph-based Recommender Systems
Recent progress in Graph Neural Networks has resulted in wide adoption by
many applications, including recommendation systems. The reason for Graph
Neural Networks' superiority over other approaches is that many problems in
recommendation systems can be naturally modeled as graphs, where nodes can be
either users or items and edges represent preference relationships. In current
Graph Neural Network approaches, nodes are represented with a static vector
learned at training time. This static vector might only be suitable to capture
some of the nuances of users or items they define. To overcome this limitation,
we propose using a recently proposed model inspired by category theory: Sheaf
Neural Networks. Sheaf Neural Networks, and its connected Laplacian, can
address the previous problem by associating every node (and edge) with a vector
space instead than a single vector. The vector space representation is richer
and allows picking the proper representation at inference time. This approach
can be generalized for different related tasks on graphs and achieves
state-of-the-art performance in terms of F1-Score@N in collaborative filtering
and Hits@20 in link prediction. For collaborative filtering, the approach is
evaluated on the MovieLens 100K with a 5.1% improvement, on MovieLens 1M with a
5.4% improvement and on Book-Crossing with a 2.8% improvement, while for link
prediction on the ogbl-ddi dataset with a 1.6% refinement with respect to the
respective baselines.Comment: 9 pages, 7 figure
- …